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PARALLELIZATION OF A MULTIGRID INCOMPRESSIBLE VISCOUS CAVITY FLOW 

SOLVER USING OPENMP* 

KEVIN ROE AND PIYUSH MEHROTRA 


Abstract. We describe a multigrid scheme for solving the viscous incompressible driven cavity problem 
that has been parallelized using OpenMP. The incremental parallelization allowed by OpenMP was of great 
help during the parallelization process. Results show good parallel efficiencies for reasonable problem sizes 
on an SGI Origin 2000. Since OpenMP allowed us to specify the number of threads (and in turn processors) 
at runtime, we were able to improve performance when solving on smaller/coarser meshes. This was ac- 
complished by giving each processor a more reasonable amount of work rather than having many processors 
work on very small segments of the data (and thereby adding significant overhead). 

Key words, parallel computing, SGI Origin 2000, OpenMP, multigrid, viscous driven cavity 

Subject classification. Computer Science 

1. Introduction. Effective use of parallel machines requires easily maintainable and portable program- 
ming models that allow users to exploit parallelism in applications written in a standard high-level language. 
MPI provides portability, however it can be more difficult to maintain and is not a high-level programming 
model. High Performance Fortran (HPF) is portable and fairly easy to maintain. OpenMP is also portable 
on shared memory architectures and fairly easy to maintain, although it can only be used on shared memory 
machines [1. 2, 3]. Since HPF can be used on both shared and distributed memory machines, one would 
think that it would be the parallel programming model of choice. However, OpenMP has sorin' advantages 
over HPF if one is using a shared memory machine. Most notably, the OpenMP model allows users to 
incrementally parallelize their code, which can lx* invaluable when parallelizing large codes. 

\ iscous fluid flow inside a driven cavity is a popular test case for evaluating numerical techniques. We 
describe the parallelization of a multigrid scheme for solving the viscous incompressible driven cavity problem 
using OpenMP. Two-dimensional incompressible viscous driven cavity flows are computed using a loosely 
coupled, implicit, second-order centrally differenced scheme. Mesh sequencing and a three-level V-cycle 
multigrid solver with a simple Jacobi integration smoother are used in this study. Although a multi-level 
V-cycle would more likely appear in practice, the three-level V-cycle was used simply to test OpenMP’s 
performance. 

In this paper we explain the viscous driven cavity problem, the governing equations and the numerical 
methods used. We then discuss the strategy used to parallelize the code. Tests were conducted to determine 
the performance of OpenMP when an equal number of processors are used for each grid level, and to see if 
overhead could be removed by using less processors for solvjng on coarser grids. Finally we discuss possible 
future work involving additional tests for the multigrid algorithm when more than three grid levels are 
allowed, as well as possible performance comparisons between OpenMP, MPI and HPF. 

2. Problem Domain. A driven cavity is defined as a rectangular cavity (the cases in this paper are 
all done on square cavities) where a plate of infinite length moves across the top of the cavity from left to 

*This research was supported by the National Aeronautics and Space Administration under NASA Contract No. NASl- 
9704C while the authors were in residence at the Institute for Computer Applications in Science and Engineering (ICASE), 
NASA Langley Research Center, Mail Stop 1320, Hampton, Virginia 23681-2199. {kproe ,pm}Cicase . edu. 
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right with non-dimensional speed U (Figure 2.1). The domain is discretized on a two-dimensional structured 
Cartesian mesh. Although the code was written to allow for clustering of mesh points near the walls, the 
tests conducted were all done on meshes with uniform cell spacing. 



Fig. 2.1. Square cavity with an infinitely long plate moving across the top of the cavity 


3. Governing Equations. The non- conservative form of the governing equations for unsteady incom- 
pressible laminar viscous flow is as follows 


(3.1) 

(3.2) 

(3.3) 


u x -f Vy = 0, 

Ut + UU X + VUy + P x = V 2 U, 

v t + uv x + Wy + P x = — V 2 r, 


where R e is the Reynolds number. 

An alternative method to solving the governing equations can avoid the explicit appearance of pressure 
by using the vorticity and stream function as dependent variables in the two-dimensional case. The two- 
dimensional vorticity transport equation is as follows 


(3.4) 


C t 4" w Cx vQy ^ C? 

lip 


where £ is the vorticity defined by 


(3.5) C = u y -v x . 

The stream function in two dimensions is defined by 


(3.6) « = 4>y and v - -^x* 

Substituting this into (3.5) produces the Poisson equation for the stream function 

(3.7) VV = C- 

The governing equations (Equations 3.1-3. 7) are discretized using second-order expressions for spatial 
derivatives and a first-order implicit expression in time. The equations can be rewritten in two simplified 
forms: 


2 



(3.8) 

(3.9) 
where, 


ft + Uf X + Vfy - ^-V 2 f = -() 

-v 2 / = -g 


(3.10) 

(3.11) 

(3.12) 


/ = [it, v, <, P, y >] , 

.9 = [P x ,Py,0,gg,C] and 


1 


-gg = («! + Vy) t + u'i + 2 u y v x + r 2 - — V 2 (u x + Vy) . 


We create a loosely-coupled set of equations by evaluating / implicitly at the advanced time step, while 
the non-linear terms u, e, and g are lagged at the previous time step. The resulting discretization is as 
follows: 


(3.13) 


and. 


(3.14) 


/ - f" + « 


M 


2 Ax 


(/i+1 ” fi-\) + o~A~ 7. ~ fj-l) 
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a •) ( 4 1 
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2 Ay 

2/ + fi-i) (fj+\ “ 2/ + fj-t ) 


= -0 


(/,+! - 2/ + /,_,)- (f j+ i - 2/ + /,-_,) = -g' 


where / represents / n+1 . g ' is the discretized form of g, 


(3.15) 
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where, 


, (u x + v y ) , < +1 - «?_, , (u? +1 -«?_!)(«,?» -»?- 1) A (tf+1 -^t) 2 
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Also note that 


(3-17) 

(3.18) 


(u x + v y ) n+l = 0 and 


(lt X + Vy) 
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We can rearrange Equations 3.13 and 3.14 into 


(3-19) Aijfij + Ai+ijfi+ij + Ai-ijfi-ij + + Aij-ifij-i = B/j. 

When we are solving for u, v, and the coefficients are defined as: 


4 , At ^ A t 

Aij — 1 + 2— — - — — -b 2- 


R e Ax 2 RrAy 2 ' 


4 A/ At 

^I±1J = -^1-^7 ± 


= 


R e Ax 2 2 Ax 

At At 

dr — — and 


BeAj/ 2 2 A y 

Bij = r - At g'. 


When we are solving for P and i/'-N the coefficients are defined as: 


_ 2_ 2 

<5x 2 A?/ 2 7 




1 

Ax 2 ’ 

1 

Ay 2 7 




Now we can solve the linearized system of governing equations in the form of Af = b. A more in depth 
discussion on the solution of the viscous driven cavity problem can be seen in other references [5, 6, 7]. 

4. Numerical Methods. The flow solver approximately inverts a penta-diagonal linear system at 
each time step in an attempt to have the dependent variables reach a steady state condition. A Symmetric 
Gauss-Seidel (SGS) iteration is used to approximately invert Af = b. A V-cycle multigrid is used to more 
quickly damp low frequency errors from the solution; however, the V-cycle is currently limited to three grid 
levels (fine, medium, and coarse). In addition, grid sequencing is used initially to speed the convergence of 
the solution on the finest grid. A more detailed description of the multigrid algorithm can be seen in other 
references [8, 9]. 

5. Parallelization Strategy. Even though the multigrid code used here was initially written without 
any thought of being parallelized, only a few changes to the code were required prior to its parallelization. 
Initially, a Symmetric Gauss Seidel (SGS) algorithm was used to speed convergence; however, there were 
complications when attempting to parallelize this algorithm; a red-black version of this algorithm was found 
to be unstable [4]. Similarly, when either a regular or red-black Gauss Seidel algorithm was substituted in 
place of the SGS method, the solution became unstable. Thus, a simple Jacobi algorithm was inserted to 
replace the SGS algorithm for testing purposes. 

Aside from this, modifications to the code were only in the form of parallel directives specifying the 
number of processors to use in parallel sections, which loops to parallelize, and which variables should 
be considered shared or private. The code was incrementally parallelized, which aided tremendously in the 
conversion process for two reasons. First, we did not have to determine the data mapping for all the variables 
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in the entire code. Second, using OpenMP in contrast to MPI or HPF made it easier to efficiently parallelize 
sections of the code and determine bottlenecks. 

The next test involved using a different number of processors for each grid level. Although we assign all 
available processors when computing at the finest grid level, this may not yield a sufficient amount of work 
for each processor when computing at the coarse grid level. In fact, the synchronization overhead, when 
using all processors at a coarser grid level, will most likely degrade overall performance. Since it is in shared 
memory, we do not have to worry about the cost associated with reshaping the data if we wish to use less 
threads/processors (than at the fine grid level) in OpenMP, we can assign a different number of processors 
to each grid level in the multigrid algorithm to better optimize performance (Figure 5.1). This code was 
written so that we can specify tin? number of processors to use for each grid level in an input file. Thus the 
code can be run with a different number of processors per grid level without recompilation. 



Pig. 5.1. Exarnple: Employing a different number of processors for each mesh level 


6. Results. The viscous driven cavity flow solver was tested on the SGI Origin 2000 at NASA Ames 
Research Center. All timings were measured using clock calls within the code rather than profiling in order 
to remove as much overhead as possible; also, results w^ere the average of ten runs. Timings were started 
after initial I/O and stopped w r hen the solution had converged. The code was compiled with the MIPSpro 
7.30 f90 compiler using OpenMP directives. 

Results show r sufficient parallel efficiencies when reasonable problem sizes were tested on the SGI Origin 
2000. Table 6.1 and Figure 6.1 show r that when the coarse grid size is sufficiently large (a factor of 8 smaller 
than the fine grid) w^e are able to achieve good parallel efficiencies. When the fine grid is 64x64. the medium 
grid is only 32x32 and the coarse grid is only 16x16. Because there is little work to do for each processor, it 
is difficult to obtain much benefit from using multiple processors. When the fine grid is increased to 256x256 
(the medium grid is 128x128 and the coarse grid is 64x64), there is now r more work for each processor to 
do even wffien multiple processors are used. For the 256x256 grid, two processors yield an excellent parallel 
efficiency of 0.94 and then steadily drops off as more processors are assigned. Larger problems continue this 
trend obtaining a parallel efficiency of 0.96 for two processors in the 512x512 fine grid case. Also, notice 
that wffien the code was parallelized with OpenMP, it maintained good single node performance. This is an 
advantage over HPF since many current HPF compilers are still having difficulty obtaining good single node 
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Table 6.1 

Performance of driven cavity problem using OpenMP 


Processors 

Fine Grid Size 

Execution Time (sec) 

Parallel Efficiencies 

64x64 

128x128 

256x256 

512x512 

64x64 

128x128 

256x256 

512x512 

seq 

6.62 

32.60 

193.35 

4963.34 





1 

6.81 

33.10 

198.82 

4990.18 

0.97 

0.98 

0.97 

0.99 

2 

4.18 

18.52 

102.85 

2582.37 

0.79 

0.88 

0.94 

0.96 

4 

3.09 

11.13 

56.21 

1375.71 

0.54 

0.73 

0.86 

0.90 

8 

2.21 

8.41 

35.90 

817.85 

0.37 

0.48 

0.67 

0.76 

16 

1.97 

5.26 

23.59 

512.76 

0.21 

0.39 

0.51 

0.60 



Processors 


Fig. 6.1. Performance of driven cavity problem using OpenMP 


performance. 

Initially, we believed that the performance was lower than expected because of the Origin 2000 s memory 
placement policy [10, 11]. This is because the arrays were initialized serially; thus the arrays were assigned 
to the processor which performed the initialization rather than being distributed among all the processors 
which subsequently operated on the arrays. The array was then initialized in parallel and although this 
improved performance, it was to a limited degree. If only a few iterations were run, then the impact on 
performance would be significant. When the code was run until convergence (on the order of 1000-10000 
iterations depending on the test conditions) the impact on performance was small enough to be within the 
margin of error for our timings. 

OpenMP also enables the user to specify the number of threads (and, hence processors) for each multigrid 
level, if desired. This allowed for an optimal number of processors to be specified at each grid level. Table 6.2 
and Figure 6.2 show that execution time can be reduced when tuning the number of processors for each grid 
level. Although some benefit was obtained, it was not as major an improvement as originally hoped for. 

Plots of the solution to the viscous driven cavity for R e = 1000 using a 128x128 mesh can be seen 
in Figure 6.3. The streamline trace shows the main circulation zone close to the center of the cavity and 
two recirculation zones in the bottom corners of the cavity. The velocity vector plot also shows the main 
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Table 6.2 

Performance with a different number of processors set per grid level 


Processors 

Fine Grid Size 

Execution Time (sec) 

Parallel Efficiencies 

fine 

med 

coarse 

64x64 

128x128 

256x256 

512x512 

64x64 

128x128 

256x256 

512x512 

8 

8 

8 

2.21 

8.41 

35.90 

817.85 

0.37 

0.48 

0.67 

0.76 

8 

8 

4 

2.01 

7.90 

33.87 

810.10 

0.41 

0.52 

0.71 

0.76 

16 

16 

16 

1.97 

5.26 

23.59 

512.76 

0.21 

0.39 

0.51 

0.60 

16 

16 

8 

1.52 

4.71 

21.43 

498.96 

0.27 

0.43 

0.56 

0.62 


1.0 

0.9 

0.8 



0.7 

o 

| 0.6 

lo, 

o> 

£ 0.4 

03 

CL 

0.3 


0.2 


0.1 


• • 64x64 ! 

| r, 128x128 

( • 256x256 

> O 512x512 

64x64* 

128x128* 

1 256x256* 

: 512x512*1 







0.0 


Processors 


8 


16 


4 Using a different # of processors for the gnd levels 
£vt6 processors on fine and m-rfcijm grid level: 4/8 on coarse level 


Fig. 6.2. Performance wheti a different number of processors are specified for each rnultiqrid level 


circulation zone as well as the two recirculation zones, although they are more difficult to see because of the 
differences in magnitude. 

7. Conclusions/Future Work. The results show that OpenMP can achieve sufficient parallel per- 
formance with simple multigrid codes. The code was written such that the user can specify the number of 
processors for each grid level in the configuration file so that re-compilation is not necessary. It is conceivable 
that a routine could be incorporated into the code that would determine the optimal number of threads to 
use for each grid level at runtime. The next stage would be to remove the three level restriction and re-test, 
OpenMP s parallel performance. This would most likely show even more of a reduction in overhead when 
very coarse grids are used (relative to the fine grid). A direct comparison to MPI and HPF equivalent codes 
would also be desired, however it would be difficult to compare the use of a variable number of processors for 
each grid level since there would be implementation difficulties when using MPI or HPF (specifically with 
reshaping of the data at runtime which would be prohibitively expensive). 
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